BMM-Based Chinese Word Segmentor with Word Support Model for the SIGHAN Bakeoff 2006
نویسنده
چکیده
This paper describes a Chinese word segmentor (CWS) for the third International Chinese Language Processing Bakeoff (SIGHAN Bakeoff 2006). We participate in the word segmentation task at the Microsoft Research (MSR) closed testing track. Our CWS is based on backward maximum matching with word support model (WSM) and contextual-based Chinese unknown word identification. From the scored results and our experimental results, it shows WSM can improve our previous CWS, which was reported at the SIGHAN Bakeoff 2005, about 1% of F-measure.
منابع مشابه
Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff
This paper describes a Chinese word segmentor (CWS) based on backward maximum matching (BMM) technique for the 2 nd Chinese Word Segmentation Bakeoff in the Microsoft Research (MSR) closed testing track. Our CWS comprises of a context-based Chinese unknown word identifier (UWI). All the context-based knowledge for the UWI is fully automatically generated by the MSR training corpus. According to...
متن کاملPOC-NLW Template for Chinese Word Segmentation
In this paper, a language tagging template named POC-NLW (position of a character within an n-length word) is presented. Based on this template, a twostage statistical model for Chinese word segmentation is constructed. In this method, the basic word segmentation is based on n-gram language model, and a Hidden Markov tagger based on the POC-NLW template is used to implement the out-of-vocabular...
متن کاملA Pragmatic Chinese Word Segmentation System
This paper presents our work for participation in the Third International Chinese Word Segmentation Bakeoff. We apply several processing approaches according to the corresponding sub-tasks, which are exhibited in real natural language. In our system, Trigram model with smoothing algorithm is the core module in word segmentation, and Maximum Entropy model is the basic model in Named Entity Recog...
متن کاملWord Boundary Token Model for the SIGHAN Bakeoff 2007
This paper describes a Chinese word segmentation system based on word boundary token model and triple template matching model for extracting unknown words; and word support model for resolving segmentation ambiguity.
متن کاملCharacter Language Models for Chinese Word Segmentation and Named Entity Recognition
We describe the application of the LingPipe toolkit (Alias-i 2006) to Chinese word segmentation and named entity recognition. We provide results for the third SIGHAN Chinese language processing bakeoff (Levow 2006). F1 measures on the best performing corpora were .972 for word segmentation and .855 for person/location/organization named-
متن کامل